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By Tyler H. McCormick, Cynthia Rudin and David Madigan 

University of Washington, Massachusetts Institute of Technology 
and Columbia University 

We propose a statistical modeling technique, called the Hierarchi- 
cal Association Rule Model (HARM), that predicts a patient's pos- 
sible future medical conditions given the patient's current and past 
history of reported conditions. The core of our technique is a Bayesian 
hierarchical model for selecting predictive association rules (such as 
"condition 1 and condition 2 — ► condition 3") from a large set of 
candidate rules. Because this method "borrows strength" using the 
conditions of many similar patients, it is able to provide predictions 
specialized to any given patient, even when little information about 
the patient's history of conditions is available. 



1. Introduction. The emergence of large-scale medical record databases 
presents exciting opportunities for data-based personalized medicine. Pre- 
diction lies at the heart of personalized medicine and in this paper we 
propose a statistical model for predicting patient-level sequences of med- 
ical conditions. We draw on new approaches for predicting the next event 
within a "current sequence," given a "sequence database" of past event 
sequences [Rudin et al. (2011a, 2011b)]. Specifically, we propose the Hierar- 
chical Association Rule Model (HARM) that generates a set of association 
rules such as dyspepsia and epigastric pain heartburn, indicating that 
dyspepsia and epigastric pain are commonly followed by heartburn. HARM 
produces a ranked list of these association rules. Patients and caregivers can 
use the rules to guide medical decisions [see Hood and Friend (2011), e.g.], 
while systems can use predictions to allocate resources [Vogenberg (2009)]. 
Built-in explanations represent a particular advantage of the association 
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rule framework — the rule predicts heartburn because the patient has had 
dyspepsia and epigastric pain. 

In our setup, we assume that each patient visits a healthcare provider 
periodically. At each encounter, the provider records time-stamped medical 
conditions experienced since the previous encounter. In this context, we 
address several prediction problems such as the following: 

• Given data from a sequence of past encounters, predict the next condition 
that a patient will report. 

• Given basic demographic information, predict the first condition that a pa- 
tient will report. 

• Given partial data from an encounter (and possibly prior encounters), 
predict the next condition. 

Though medical databases often contain records from thousands or even 
millions of patients, most patients experience only a handful of the massive 
set of potential conditions. This patient-level sparsity presents a challenge 
for predictive modeling. Our hierarchical modeling approach attempts to 
address this challenge by borrowing strength across patients. 

The sequential event prediction problem is new, a supervised learning 
problem that has been formalized here and by Rudin et al. (2011a, 2011b). 
Rules are particularly useful in our context: rules yield very interpretable 
models, and their conditional probabilities involve few variables and are thus 
more reliable to estimate. 

The experiments this paper presents indicate that HARM outperforms 
several baseline approaches, including a standard "maximum confidence, 
minimum support threshold" technique used in association rule mining, and 
also a nonhierarchical version of our Bayesian method [from Rudin et al. 
(2011a, 2011b)] that ranks rules using "adjusted confidence." 

More generally, HARM yields a prediction algorithm for sequential data 
that can potentially be used for a wide variety of applications beyond con- 
dition prediction. For instance, the algorithm can be directly used as a rec- 
ommender system (e.g., for vendors such as Netflix, amazon.com or online 
grocery stores such as Fresh Direct and Peapod). It can be used to predict 
the next move in a video game in order to design a more interesting game, 
or it can be used to predict the winners at each round of a tournament (e.g., 
the winners of games in a football season). All of these applications possess 
the same basic structure as the condition prediction problem: a database 
consisting of sequences of events, where each event is associated to an in- 
dividual entity (medical patient, customer, football team). As future events 
unfold in a new sequence, our goal is to predict the next event. 

In Section 2 we provide basic definitions and present our model. In Sec- 
tion 3 we evaluate the predictive performance of HARM, along with several 
baselines through experiments on clinical trial data. Section 4 provides re- 
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lated work, and Section 5 provides a discussion and offers potential exten- 
sions. 

2. Method. This work presents a new approach to association rule min- 
ing by determining the "interestingness" of rules using a particular (hierar- 
chical) Bayesian estimate of the probability of exhibiting condition b, given 
a set of current conditions, a. We will first discuss association rule mining 
and its connection to Bayesian shrinkage estimators. Then we will present 
our hierarchical method for providing personalized condition predictions. 

2.1. Definitions. An association rule in our context is an implication 
a—>b where the left side is a subset of conditions that the patient has expe- 
rienced, and b is a single condition that the patient has not yet experienced 
since the last encounter. Ultimately, we would like to rank rules in terms of 
"interestingness" or relevance for a particular patient at a given time. Using 
this ranking, we make predictions of subsequent conditions. Two common 
determining factors of the "interestingness" of a rule are the "confidence" 
and "support" of the rule [Agrawal, Imielihski and Swami (1993); Piatetsky- 
Shapiro (1991)]. 

The confidence of a rule a — > b for a patient is the empirical probability: 

„ „, JN Number of times conditions a and b were experienced 

Conf (a — )> b) := — — — : 

Number of times conditions a were experienced 

:=P(b\a). 

The support of set a is as follows: 

Support(a) := Number of times conditions a were experienced 

oc P(a), 

where P(a) is the empirical proportion of times that conditions a were expe- 
rienced. When a patient has experienced a particular set of conditions only 
a few times, a new single observation can dramatically alter the confidence 
P(b\a) for many rules. This problem occurs commonly in our clinical trial 
data, where most patients have reported fewer than 10 total conditions. The 
vast majority of rule mining algorithms address this issue with a minimum 
support threshold to exclude rare rules, and the remaining rules are evalu- 
ated for interestingness [reviews of interestingness measures include those of 
Tan, Kumar and Srivastava (2002); Geng and Hamilton (2007)]. The defini- 
tion of interestingness is often heuristic, and is not necessarily a meaningful 
estimate of P{b\a). 

It is well known that problems arise from using a minimum support 
threshold. For instance, consider the collection of rules meeting the min- 
imum support threshold condition. Within this collection, the confidence 
alone should not be used to rank rules: among rules with similar confi- 
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dence, the rules with larger support should be preferred. More importantly, 
"nuggets," which are rules with low support but very high confidence, are 
often excluded by the threshold. This is problematic, for instance, when 
a condition that occurs rarely is strongly linked with another rare condi- 
tion; it is essential not to exclude the rules characterizing these conditions. 
In our data, the distribution of conditions has a long tail, where the vast 
majority of events happen rarely: out of 1800 possible conditions, 1400 occur 
less than 10 times. These 1400 conditions are precisely the ones in danger 
of being excluded by a minimum support threshold. 

Our work avoids problems with the minimum support threshold by rank- 
ing rules with a shrinkage estimator of P(b\a). These estimators directly 
incorporate the support of the rule. One example of such an estimator is the 
"adjusted confidence" [Rudin et al. (2011a, 2011b)]: 



The effect of the penalty term K is to pull low-support rules toward the 
bottom of the list; any rule achieving a high adjusted confidence must over- 
come this pull through either a high enough support or a high confidence. 
Using the adjusted confidence avoids the problems discussed earlier: "in- 
terestingness" is closely related to the conditional probability P(b\a), and, 
among rules with equal confidence, the higher support rules are preferred, 
and there is no strict minimum support threshold. 

In this work we extend the adjusted confidence model in an important re- 
spect, in that our method shares information across similar patients to better 
estimate the conditional probabilities. The adjusted confidence is a particu- 
lar Bayesian estimate of the confidence. Assuming a Beta prior distribution 
for the confidence, the posterior mean is 



where j^x is the support of condition x, and a and f3 denote the parameters 
of the (conjugate) Beta prior distribution. Our model allows the parameters 
of the Binomial to be chosen differently for each patient and also for each 
rule. This means that our model can determine, for instance, whether a par- 
ticular patient is more likely to repeat a condition that has occurred only 
once, and also whether a particular condition is more likely to repeat than 
another. 

We note that our approach makes no explicit attempt to infer causal re- 
lationships between conditions. The observed associations may in fact arise 
from common prior causes such as other conditions or drugs. Thus, a rule 
such as dyspepsia — > heartburn does not necessarily imply that successful 
treatment of dyspepsia will change the probability of heartburn. Rather, the 
goal is to accurately predict heartburn in order to facilitate effective med- 



AdjConf (a : 



Number of times conditions a and b were experienced 



Number of times conditions a were experienced + K 




a + #{akb) 
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ical management. The article of Shmueli (2010) contains a more complete 
discussion of this distinction. 

2.2. Hierarchical association rule model (HARM). For a patient i and 
a given rule, r, say, we observe yi r co-occurrences (number of times lhs and 
rhs were experienced) , where there were a total of rii r encounters that include 
the lhs (rii r is the support for lhs). We model the number of co-occurrences 
as Binomial (nj r ,pj r ) and then model pi r hierarchically to share information 
across groups of similar individuals. Define Masa/xfl matrix of static 
observable characteristics for a total of / individuals and D observable char- 
acteristics, where we assume D > 1 (otherwise we revert back to a model 
with a rule-wise adjustment). Each row of M corresponds to a patient and 
each column to a particular characteristic. We define the columns of M to 
be indicators of particular patient categories (gender, or age between 30 and 
40, e.g.), though they could be continuous in other applications. Let Mj 
denote the ith. row of the matrix M. We model the probability for the ith 
individual and the rth rule pi r as coming from a Beta distribution with 
parameters 7Tj r and Tj. We then define 7Tj r through the regression model 
■k^ = exp(M^/3 r + 7i), where (3 r defines a vector of regression coefficients 
for rule r and ji is an individual-specific random effect. More formally, we 
propose the following model: 

y ir ~ Binomial (n ir ,p ir ), 

Pir ~ Beta(7T ir ,Ti), 

7r ir = exp(M' i /3 r . + 7i). 

Under this model, 

jp/ I \ Uir ~l~ ^ir 

^ \Pir | Uir i i^ir I — ~ J j 
riir T~ Kir ~r 1~i 

which is a more flexible form of adjusted confidence. This expectation also 
produces nonzero probabilities for a rule even if rii r is zero (patient i has 
never reported the conditions on the left-hand side of r before). This could 
allow rules to be ranked more highly even if nj r is zero. The fixed effect 
regression component, M'-/3 r , adjusts 7Tj r based on the patient characteristics 
in the M matrix. For example, if the entries of M represented only gender, 
then the regression model with intercept (3 r ,o would be j3 r fl + /3 ri il m aie> where 
3-maie is one for male respondents and zero for females. Being male, therefore, 
has a multiplicative effect of e^ 1 "- 1 on 7Tj r . In this example, the M^/3 r value 
is the same for all males, encouraging similar individuals to have similar 
values of 7Tj r . For each rule r, we will use a common prior on all coefficients 
in (3 r ; this imposes a hierarchical structure, and has the effect of regularizing 
coefficients associated with rare characteristics. 

The 7Tj r 's allow rare but important "nuggets" to be recommended. Even 
across multiple patient encounters, many conditions occur very infrequently. 
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In some cases these conditions may still be highly associated with certain 
other conditions. For instance, compared to some conditions, migraines are 
relatively rare. Patients who have migraines, however, typically also experi- 
ence nausea. A minimum support threshold algorithm might easily exclude 
the rule "migraines —¥ nausea" if a patient hasn't experienced many mi- 
graines in the past. This is especially likely for patients who have few en- 
counters. In our model, the 7Tj r term balances the regularization imposed 
by Ti to, for certain individuals, increase the ranking of rules with high con- 
fidence but low support. The Tj term reduces the probability associated with 
rules that have appeared few times in the data (low support), with the same 
effect as the penalty term (K) in the adjusted confidence. Unlike the cross- 
validation or heuristic strategies suggested in Rudin et al. (2011a, 2011b), 
we estimate n as part of an underlying statistical model. Within a given 
rule, we assume Tj for every individual comes from the same distribution. 
This imposes additional structure across individuals, increasing stability for 
individuals with few observations. 

It remains now to describe the precise prior structure on the regression 
parameters and hyperparameters. We assign Gaussian priors with mean 
and variance a^. to the r on the log scale. Since any given patient is unlikely 
to experience a specific medical condition, the majority of probabilities are 
close to zero. Giving n a prior with mean zero improves stability by discour- 
aging excessive penalties. We assign all elements j3 r ^ of vectors /3 r a com- 
mon Gaussian prior on the log scale with mean \ip and variance cr|. We also 
assume each ji comes from a Gaussian distribution on the log scale with 
common mean /u 7 and variance <r^. Each individual has their own 7, term, 
which permits flexibility among individuals; however, all of the ji terms 
come from the same distribution, which induces dependence between indi- 
viduals. We assume diffuse uniform priors on the hyperparameters o^, [ip 
and a%. Denote Y as the matrix of yi r values, N as the matrix of ni r val- 
ues, and (3 as the collection of /3 l5 . . . , (3 R . The prior assumptions yield the 
following posterior: 

/ R 

p, tt, r, 0\ Y, N, M cx J] J] p^ +%ir (l - p^-Vir+n 

i=l r=l 
R D 

x n ri Normai ( iog (^)i^'^) 

r=ld=l 
I 

x Yl Normal (log (7* ) | /i 7 , a* ) Normal (log (n )\0,a*). 
i=i 

HARM produces draws from the (approximate) posterior distribution for 
each pi r . Since these terms will be used for ranking the rules, we refer to 
them as rescaled risk. We also note that, even though the Pi r 7 s represent 
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r-Tl I I I I I I I 
I 1 1 1 1 1 1 1 

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 

rescaled risk 

Fig. 1. Approximate posterior of rescaled risk for two rules. These are histograms of the 
posterior means for the set of patients. 

probabilities in our model, they are not interpretable as the probability that 
a patient will have a given condition at the next visit to a provider (since our 
model is not calibrated to time between visits). Figure 1 shows estimates 
of the posterior rescaled risk for high cholesterol —¥ myocardial infarction 
and hypertension —?■ myocardial infarction. Comparing the distributions of 
related rules can often provide insights into associations in the data, as we 
demonstrate in Section 3. 2. In the context of medical condition prediction, 
these rescaled risks are of interest and we analyze our estimates of their 
full posterior distributions in Section 3.2. To rank association rules for the 
purpose of prediction, however, we need a single estimate for each probability 
(rather than a full distribution), which we chose as the posterior mean. In 
practice, we suggest evaluating the mean as well other estimators for each 
rescaled risk (the mode or median, e.g.) and selecting the one with the best 
performance in each particular application. We carry out our computations 
using a Gibbs sampling algorithm, provided in Figure 2. 

2.3. Approximate updating. Given a batch of data, HARM makes pre- 
dictions based on the posterior distributions of the Pi r J s. Since the posteriors 
are not available in closed form, we need to iterate the algorithm in Figure 2 
to convergence in order to make predictions. Each time the patient visits 
the physician, each pi r could be updated by again iterating the algorithm in 
Figure 2 to convergence. In some applications, new data continue to arrive 
frequently, making it impractical to compute approximate posterior distri- 
butions using the algorithm in Figure 2 for each new encounter. In this 
section we provide an approximate updating scheme to incorporate new pa- 
tient data after an initial batch of encounters has already been processed. 
The approximate scheme can be used for real-time online updating. 

Beginning with an initial batch of data, we run the algorithm in Fig- 
ure 2 to convergence in order to obtain fj and 7Tj r , which are defined to 
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For a suitably initialized chain, at step v: 

1. Update pi r from the conjugate Beta distribution given 7Tj r ,Tj, Y,N,M. 

2. Update n using a Metropolis step with proposal r* where 

log(r*) ~ N(rj 1 , (scale of jumping dist)). 

3. For each rule, update the vector j3 r using a Metropolis step with 

log(/3*) ~ N^ -1 ), (scale of jumping dist)). 

4. Update ji using a Metropolis step with 

log(7*) ~ N(7> ,) , (scale of jumping dist)). 

5. Update 7Tj r = exp(M^/3 r + 7$). 

6. Update /ig ~ N(/i / g, a?) where 

\ / r =l <f=l 

7. Update <x^ ~ Inv— x 2 (^ ~~ 1> ^/j) where 

V 7 r=ld=l 

8. Update a\ ~ Inv-x 2 (^ - 1,<5" 2 ) where o\ = j^- S[=i( r i _ /^r) 2 - 

9. Update // 7 ~ N(/i 7 , cr 7 ) where /2 7 = y X^i=i 7t- 

10. Update cr^ ~ Inv- x 2 (I ~ 1, o" 2 ) where = -JL- J][=i(7i ~ /" 7 ) 2 - 

Fig. 2. Gibbs sampling algorithm for the hierarchical Bayesian association rule modeling 
for sequential event prediction (HARM). 



be the posterior means of the estimated distributions for n and 7Tj r . The 
approximate updating scheme keeps Tj and 7Tj r fixed to be fj and 7Tj r . Given 
that up to encounter e — 1 we have observed y^ ^ and nf r ^ , we are pre- 
sented with new observations that have counts y^ ewobs '^ and n (" Gwobs -) so 

(e) (e— 1) . (newobs.) -, (e) (e— 1) . (newobs.) T n 

that y\ r ' = y\ r + y\ r and n\ r ' = n\ r + n\ r . In order to up- 

(e) (e) 

date the probability estimates to reflect our total current data, y\ r , n\ r , we 
will use the following relationship: 

P(v \v {e) n {e) T- 7T- ) OC P(i/ ncwobs ') | n ( ncwobs ') D . 1 
\Pir\yir i n i r ■> 1 1, "ir ) ^ r \ili r V^ir i fir ) 

X P {Pir \yt > n ir _1) ' f i ' ■ 
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The expression P(pi T \y£. l \nf r 1 \fj,7Tj r .) is the posterior up to encounter 
e — 1 and has a Beta distribution. The likelihood of the new observations, 

P(yir CW ° bS \n[^ ewohs '\pir), is Binomial. Conjugacy implies that the updated 
posterior also has a Beta distribution. In order to update the probability 
estimates for our hierarchical model, we use the expectation of this distri- 
bution, that is, 

,,( e_1 ) _|_ ? ,ncwobs. i - 

E(v \v {e) n (e) t- 7T- ^ - Vir 



n% 1] + n f r ewobs - + ir ir + n 



3. Application to repeated patient encounters. We present results of 
HARM, with the approximate updating scheme in Section 2.3, on co-prescrib- 
ing data from a large clinical trial. In the trial, each patient visits a health- 
care provider periodically. At each encounter, the provider records time- 
stamped medical conditions (represented by MedDRA terms) experienced 
since the previous encounter. Thus, each encounter is associated with a se- 
quence of medical conditions. These data are from around 42,000 patient 
encounters from about 2,300 patients, all at least 40 years old. The matrix of 
observable characteristics encodes the basic demographic information: gen- 
der, age, and ethnicity. For each patient we have a record of each medication 
prescribed and the condition/chief complaint (back pain, asthma, etc.) that 
warranted the prescription. We chose to predict patient complaints rather 
than prescriptions since there are often multiple prescribing options (medica- 
tions) for the same complaint. Some patients had preexisting conditions that 
continued throughout the trial. For these patients, we include these preex- 
isting conditions in the patient's list of conditions at each encounter. Other 
patients have recurrent conditions for which we would like to predict the 
occurrences. If a patient reports the same condition more than once during 
the same thirty day period, we only consider the first occurrence of the con- 
dition at the first report. If the patient reports the condition once and then 
again more than thirty days later, we consider this two separate incidents. 

As covariates, we used age, gender, race and drug/placebo (an indicator 
of whether the patient was in the treatment or control group for the clinical 
trial). We fit age using a series of indicator variables corresponding to four 
groups (40-49, 50-59, 60-69, 70+). We included all available covariates in 
our simulation studies. In practice, model selection will likely be essential to 
select the best subset of covariates for predictive performance. We discuss 
covariate selection in further detail in the supplemental article [McCormick, 
Rudin and Madigan (2011)]. 

Our experiments consider only the marginal probabilities (support) and 
probabilities conditional on one previous condition. Thus, the left-hand side 
of each rule contains either items or 1 item. In our simulations, we used 
chains of 5000 iterations, keeping every 10th iteration to compute the mean 
we used for ranking and discarding the first thousand iterations. 
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In Section 3.1 we present experimental results to compare the predictive 
performance of our model to other rule mining algorithms for this type of 
problem. In Section 3.2 we use the probability estimates from the model 
to demonstrate its ability to find new associations; in particular, we find 
associations that are present in the medical literature but that may not be 
obvious by considering only the raw data. 

3.1. Predictive performance. We selected a sample of patients by assign- 
ing each patient a random draw from a Bernoulli distribution with success 
probability selected to give a sample of patients on average around 200. For 
each patient we drew uniformly an integer t{ between and the number of 
encounters for that patient. We ordered the encounters chronologically and 
used encounters 1 through ti as our training set and the remaining encoun- 
ters as the test set. Through this approach, the training set encompasses 
the complete set of encounters for some patients ( "fully observed" ) , includes 
no encounters for others ("new patients"), and a partial encounter history 
of the majority of the test patients ("partially-observed patients"). We be- 
lieve this to be a reasonable approximation of the context where this type 
of method would be applied, with some patients having already been ob- 
served several times and other new patients entering the system for the first 
time. We evaluated HARM's predictive performance using a combination of 
common and rare conditions. For each run of the simulation, we use the 25 
most popular conditions, then randomly select an additional 25 conditions 
for a total of 50. 

The algorithm was used to iteratively predict the conditions revealed at 
each encounter. For each selected patient, starting with the first test en- 
counter, and prior to that encounter's first condition being revealed, the 
algorithm made a prediction of c possible conditions, where c = 3. Note that 
to predict the very first condition for a given patient when there are no 
previous encounters, the recommendations come from posterior means of 
the coefficients estimated from the training set. The algorithm earned one 
point if it recommended the current condition before it was revealed, and no 
points otherwise. Then, yi r and rii r were updated to include the revealed con- 
dition. This process was repeated for the patient's remaining conditions in 
the first encounter, and repeated for each condition within each subsequent 
encounter. We then moved to the next patient and repeated the procedure. 

The total score of the algorithm for a given patient was computed as 
the total number of points earned for that patient divided by the total 
number of conditions experienced by the patient. The total score of the 
algorithm is the average of the scores for the individual patients. Thus, the 
total score is the average proportion of correct predictions per patient. We 
repeated this entire process (beginning with selecting patients) 500 times and 
recorded the distribution over the 500 scores. We compared the performance 
of HARM (using the same scoring system) against an algorithm that ranks 
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Fig. 3. Predictive performance for (a,) all patients, (h) partially-observed patients, 
(c) new patients. Each boxplot represents the distribution of scores over 500 runs. For (a), 
each run's score (an individual point on a boxplot) is based on a sample of approximately 
200 patients. For (b) and (c), each point is based on a subset of these ^200 patients. 



rules by adjusted confidence, for several values of K. We also compared with 
the "max confidence minimum support threshold" algorithm for different 
values of the support threshold 9, where rules with support below 9 are 
excluded and the remaining rules are ranked by confidence. For both of 
these algorithms, no information across patients is able to be used. 

Figure 3 shows the results, as boxplots of the distribution of scores for 
the entire collection of partially-observed, fully observed and new patients. 
Paired t-tests comparing the mean proportion of correct predictions from 
HARM to each of the alternatives had p- values for a significant difference in 
our favor less than 10 -15 . In other words, HARM has statistically superior 
performance over all K and 9, that is, better performance than either of 
the two algorithms even if their parameters K and 9 had been tuned to 
the best possible value. For all four values of K for the adjusted confidence, 
performance was slightly better than for the plain confidence (K = 0). The 
"max confidence minimum support threshold" algorithm (which is a stan- 
dard approach to association rule mining problems) performed poorly for 
minimum support thresholds of 2 and 3. This poor performance is likely 
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HC Hy HC Hy HC Hy HC Hy HC Hy HC Hy 

Caucasian Afr. Amer./Hisp. Asian Caucasian Afr. Amer./Hisp. Asian 



(a) (b) 

Fig. 4. Propensity of myocardial infarction inpatients who have reported high cholesterol 
or hypertension using (&) HARM and (b) (unadjusted) confidence. For each demographic 
group, high cholesterol (HC) is on the left and hypertension (Hy) is on the right. Thick 
lines represent the middle half of the posterior mean propensities for respondents in the 
indicated demographic group. Outer lines represent the middle 90% and dots represent 
the mean. The vast majority of patients did not experience a myocardial infarction, which 
places the middle 90% of the distributions in plot (b ) approximately at zero. 



due to the sparse information we have for each patient. Setting a minimum 
support threshold as low as even two eliminates many potential candidate 
rules from consideration. 

The main advantage of our model is that it shares information across 
patients in the training set. This means that in early stages where the ob- 
served i/ir and small, it may still be possible to obtain reasonably ac- 
curate probability estimates, since when patients are new, our recommenda- 
tions depend heavily on the behavior of previously observed similar patients. 
This advantage is shown explicitly through Figures 3(b) and 3(c), which per- 
tain to partially-observed and new patients, respectively. The advantage of 
HARM over the other methods is more pronounced for new patients: in 
cases where there are no data for each patient, there is a large advantage 
to sharing information. We performed additional simulations which further 
illustrate this point and are presented in the supplement [McCormick, Rudin 
and Madigan (2011)]. 

3.2. Association mining. The conditional probability estimates from our 
model are also a way of mining a large and highly dependent set of associa- 
tions. 

Ethnicity, high cholesterol or hypertension — > myocardial infarction: Fig- 
ure 4(a) shows the distribution of posterior mean propensity for myocar- 
dial infarction (heart attack) given two conditions previously reported as 
risk factors for myocardial infarction: high cholesterol and hypertension [see 
Kukline, Yoon and Keenan (2010) for a recent review]. Each bar in the figure 
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corresponds to the set of respondents in a specified ethnic group. For Cau- 
casians, we typically estimate a higher probability of myocardial infarction 
in patients who have previously had high cholesterol. In African Ameri- 
cans/Hispanics and Asian patients, however, we estimate a generally higher 
probability for patients who have reported hypertension. This distinction 
demonstrates the flexibility of our method in combining information across 
respondents who are observably similar. Race-ethnic differences in risk fac- 
tors for coronary heart disease have attracted considerable attention in the 
medical literature [see, e.g., Rosamond et al. (2007) or Willey et al. (2011)]. 

As a comparison, we also included the same plot using (unadjusted) con- 
fidence, in Figure 4(b). In both Figure 4(a) and Figure 4(b), the black dots 
are the mean across all the patients, which are not uniformly at zero be- 
cause there were some cases of myocardial infarction and hypertension or 
high cholesterol. In Figure 4(b), the colored, smaller dots represent the rest 
of the middle 90% of the distribution, which appears to be at zero in plot (b) 
since the vast majority of patients did not have a myocardial infarction at 
all, so even fewer had a myocardial infarction after reporting hypertension 
or high cholesterol. 

Age, high cholesterol or hypertension, treatment or placebo — > myocar- 
dial infarction: Since our data come from a clinical trial, we also included 
an indicator of treatment vs. placebo condition in the hierarchical regression 
component of HARM. Figures 5 and 6 display the posterior means of propen- 
sity of myocardial infarction for respondents separated by age and treatment 
condition. Figure 5 considers patients who have reported hypertension, Fig- 
ure 6 considers patients who have reported high cholesterol. In both Figure 
5 and Figure 6, it appears that the propensity of myocardial infarction pre- 
dicted by HARM is greatest for individuals between 50 and 70, with the 
association again being stronger for high cholesterol than hypertension. 

For both individuals with either high cholesterol or hypertension, use 
of the treatment medication was associated with increased propensity of 
myocardial infarction. This effect is present across nearly every age category. 
The distinction is perhaps most clear among patients in their fifties in both 
Figure 5 and Figure 6. The treatment product in this trial has been linked 
to increased risk of myocardial infarction in numerous other studies. The 
product was eventually withdrawn from the market by the manufacturer 
because of its association with myocardial infarctions. 

The structure imposed by our hierarchical model gives positive probabil- 
ities even when no data are present in a given category; in several of the 
categories, we observed no instances of a myocardial infarction, so estimates 
using only the data cannot differentiate between the categories in terms of 
risk for myocardial infarction, as particularly illustrated through Figure 6(b). 

4. Related works. Four relevant works on Bayesian hierarchical model- 
ing and recommender systems are those of DuMouchel and Pregibon (2001), 
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Fig. 5. Propensity of myocardial infarction in patients who have reported hypertension, 
estimated by (a.) HARM and (b) (unadjusted) confidence. For each demographic group, 
the placebo (P) is on the left and the treatment medication (T) is on the right. Thick 
lines represent the middle half of the posterior mean propensities for respondents in the 
indicated demographic group. Outer lines represent the middle 90% and dots represent the 
mean. Overall the propensity is higher for individuals who take the study medication than 
those who do not. 



Breese, Heckerman and Kadie (1998), Condliff et al. (1999) and Agarwal, 
Zhang and Mazumder (2012). DnMouchel and Pregibon (2001) deal with 
the identification of interesting itemsets (rather than identification of rules) . 
Specifically, they model the ratio of observed itemset frequencies to base- 
line frequencies computed under a particular model for independence. Nei- 
ther Condliff et al. (1999) nor Breese, Heckerman and Kadie (1998) aim 
to model repeat purchases (recurring conditions). Breese, Heckerman and 
Kadie (1998) use Bayesian methods to cluster users, and also suggest a Bayes- 
ian network. Condliff et al. (1999) present a hierarchical Bayesian approach 
to collaborative filtering that "borrows strength" across users. Agarwal, 
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Fig. 6. Propensity of myocardial infarction in patients who have reported high cholesterol, 
estimated by (a,) HARM and (h) (unadjusted) confidence. 
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Zhang and Mazumder (2012) also build a personalized recommender sys- 
tem that models item-item similarities. Their model uses logistic regression 
for estimating pi r rather than using 7Tj r and Tj. This has the advantage of 
being a simpler hierarchical model, but loses the interpretability our model 
has through using association rules. It also loses the potential advantage of 
estimating only conditional probabilities involving few variables. 

As far as we know, the line of work by Davis et al. (2010) is the first to 
use an approach from recommender systems to predict medical conditions, 
though in a completely different way than ours; it is based on vector sim- 
ilarity, in the same way as Breese, Heckerman and Kadie (1998). [Also see 
references in Davis et al. (2010) for background on collaborative filtering.] 
Also in the context of learning in medical problems, Gopalakrishnan et al. 
(2010) used decision rules chosen using a Bayesian network to predict disease 
state from biomarker profiling studies. 

5. Conclusion and future work. We have presented a hierarchical model 
for ranking association rules for sequential event prediction. The sequential 
nature of the data is captured through rules that are sensitive to time order, 
that is, a —> b indicates conditions a are followed by conditions b. HARM 
uses information from observably similar individuals to augment the (often 
sparse) data on a particular individual; this is how HARM is able to estimate 
probabilities P(b\a) before conditions a have ever been reported. In the 
absence of data, hierarchical modeling provides structure. As more data 
become available, the influence of the modeling choices fade as greater weight 
is placed on the data. The sequential prediction approach is especially well 
suited to medical condition prediction, where experiencing two conditions in 
succession may have different clinical implications than experiencing either 
condition in isolation. 

Model selection is important for using our method in practice. There are 
two types of model selection required for HARM: the choice of covariates 
encoded by the matrix M, and the collection of available rules. For the 
choice of covariates in M, standard feature selection methods can be used, 
for instance, a forward stagewise procedure where one covariate at a time 
is added as performance improves, or a backward stagewise method where 
features are iteratively removed. Another possibility is to combine covari- 
ates, potentially through a method similar to model-based clustering [Fraley 
and Raftery (2002)]. To perform model selection on the choice of rules, it is 
possible to construct analogous "rule selection" methods as one might use 
for a set of covariates. A forward stagewise procedure could be constructed, 
where the set of rules is gradually expanded as prediction performance in- 
creases. Further, it is possible to combine a set of rules into a single rule 
as in model-based clustering; for example, rather than separate rules where 
the left side is either "dorsal pain," "back pain," "low back pain," or "neck 
pain," we could use simply "back or neck pain" for all of them. 
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Another direction for future work is to incorporate a model for higher- 
order dependence, along the line of work by Berchtold and Raftery (2002). 
An algorithm for sequential event prediction is presented in ongoing work 
[Letham, Rudin and Madigan (2011)], which is loosely inspired by the ideas 
of Berchtold and Raftery (2002), but does not depend on association rules. 
A third potential future direction is to design a more sophisticated online 
updating procedure than the one in Section 2.3. It may be possible to design 
a procedure that approximately updates all of the hyperparameters as more 
data arrive. 

Acknowledgments. We would like to thank the Editor, Associate Editor 
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SUPPLEMENTARY MATERIAL 

Additional simulation results (DOI: 10.1214/11-AOAS522SUPP; .pdf). 
In the supplement we present additional simulation results which speak to 
the performance of HARM. 
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